Topic Detection and Tracking using idf-Weighted Cosine Coefficient

نویسندگان

  • J. Michael Schultz
  • Mark Liberman
چکیده

The goal of TDT Topic Detection and Tracking is to develop automatic methods of identifying topically related stories within a stream of news media. We describe approaches for both detection and tracking based on the well-known idf -weighted cosine coefficient similarity metric. The surprising outcome of this research is that we achieved very competitive results for tracking using a very simple method of feature selection, without word stemming and without a score normalization scheme. The detection task results were not as encouraging though we attribute this more to the clustering algorithm than the underlying similarity metric. 1. The Tracking Task The goal of the topic tracking task for TDT2 is to identify news stories on a particular event defined by a small number (Nt) of positive training examples and a greater number of negative examples. All stories in the news stream subsequent to the final positive example are to be classified as on-topic if they pertain to the event or off-topic if they do not. Although the task is similar to IR routing and filtering tasks, the definition of event leads to at least one significant difference. An event is defined as an occurrence at a given place and time covered by the news media. Stories are on-topic if they cover the event itself or any outcome (strictly-defined in [2]) of the event. By this definition, all stories prior to the occurrence are off-topic, which contrary to the IR tasks mentioned, theoretically provides for unlimited off-topic training material (assuming retrospective corpora are available). We expected to be able to take advantage of these unlimited negative examples but in our final implementation did so only to the extent that we used a retrospective corpus to improve term statistics of our database. 1.1. idf-Weighted Cosine Coefficient As the basis for our approach we used the idf -weighted cosine coefficient described in [1] often referred to as tf idf . Using this metric, the tracking task becomes two-fold. Firstly, choosing an optimal set of features to represent topics, i.e. feature selection. The approach must choose features from a single story as well as from multiple stories (for Nt > 1). Secondly, determining a threshold (potentially one per topic) which optimizes the miss and false alarm probabilities for a particular cost function, effectively normalizing the similarity scores across topics. The cosine coefficient is a document similarity metric which has been investigated extensively. Here documents (and queries) are represented as vectors in an n-dimensional space, where n is the number of unique terms in the database. The coefficients of the vector for a given document are the term frequencies (tf ) for that dimension. The resulting vectors are extremely sparse and typically high frequency words (mostly closed class) are ignored. The cosine of the angle between two vectors is an indication of vector similarity and is equal to the dot-product of the vectors normalized by the product of the vector lengths. cos( ) = ~ A ~ B k ~ Akk ~ Bk tf idf (term frequency times inverse document frequency) weighting is an ad-hoc modification to the cosine coefficient calculation which weights words according to their usefulness in discriminating documents. Words that appear in few documents are more useful than words that appear in many documents. This is captured in the equation for the inverse document frequency of a word: idf(w) = log10 N df(w) Where df(w) is the number of documents in a collection which contain word w and N is the total number of documents in the collection. For our implementation we weighted only the topic vector by idf and left the story vector under test unchanged. This allows us to calculate and fix an idf -scaled topic vector immediately after training on the last positive example story for a topic. The resulting calculation for the similarity measure becomes: sim(a; b) = Pnw=1 tfa(w) tfb(w) idf(w) pPnw=1 tf2 a (w) pPnw=1 tf2 b (w) 1.2. UPENN System Attributes To facilitate testing, the stories were loaded into a simple document processing system. Once in the system, stories are processed in chronological order testing all topics simultaneously with a single pass over the data1 at a rate of approximately 6000 stories per minute on a Pentium 266 MHz machine. The system tokenizer delimits on white space and punctuation (and discards it), collapses case, but provides no stemming. A list of 179 stop words consisting almost entirely of close classed words was also employed. In order to improve word statistics, particularly for the beginning of the test set, we prepended a retrospective corpus (the TDT Pilot Data [3]) of approximately 16 thousand stories. 1In accordance with the evaluation specification for this project [2] no information is shared across topics. 1.3. Feature Selection The choice as well as number of features (words) used to represent a topic has a direct effect on the trade-off between miss and false alarm probabilities. We investigated four methods of producing lists of features sorted by their effectiveness in discriminating a topic. This then allowed us to easily vary the number of those features for the topic vectors2. 1. Keep all features except those words belonging to the stop word list. 2. Relative to training stories, sort words by document count, keep n most frequent. This approach has the advantage of finding those words which are common across training stories, and therefore are more general to the topic area, but has the disadvantage of extending poorly from the Nt = 16 case to the Nt = 1 case. 3. For each story, sort by word count (tf ), keep n most frequent. While this approach tends to ignore low count words which occur in multiple training documents, it generalizes well from the Nt = 16 to the Nt = 1 case. 4. As a variant on the previous method we tried adding to the initial n features using a simple greedy algorithm. Against a database containing all stories up to and including the Nt-th training story, we queried the database with the n features plus the next most frequent term. If the separation of on-topic and off-topic stories increased, we kept the term, if not we ignored it and tested the next term in the list. We defined separation as the difference between the average on-topic scores and the average of the 20 highest scoring off-topic documents. Of the feature selection methods we tried the forth one yielded the best results across varying values of Nt, although only slightly better than the much simpler third method. Occam’s Razor prompted us to omit this complication from the algorithm. The DET curves3 in Figure 1. show the effect of varying the number of features (obtained from method 3) on the miss and false alarm probabilities. The upper right most curve results from choosing the single most frequent feature. Downward to the left, in order are the curves for 5, 10, 50, 150 and 300 features. After examining similar plots from the pilot, training4, and development-test data sets, we set the number of features for our system to 50. It can be seen that there is limited benefit in adding features after this point. 1.4. Normalization / Threshold Selection With a method of feature selection in place, a threshold for the similarity score must be determined above which stories will be deemed on-topic, and below which they will not. Since each topic is represented by its own unique vector it cannot be expected that the same threshold value will be optimal across all topics unless the scores are normalized. We tried two approaches for normalizing the topic similarity scores. For the first approach we calculated the similarity of a random sample of several hundred off-topic documents in order to estimate an 2We did not employ feature selection on the story under test but used the text in entirety. 3See [5] for detailed description of DET curves. 4The first two month period of TDT2 data is called the training set, not to be confused with training data. 1 2 5 10 20 40 60 80 90 .01 .02 .05 .1 .2 .5 1 2 5 10 20 40 60 80 90 M is s pr ob ab ili ty ( in % ) False Alarms probability (in %) random performance num_features=1 num_features=5 num_features=10 num_features=50 num_features=150 num_features=300 Figure 1: DET curve for varying number of features. (Nt=4, TDT2 evaluation data set, newswire and ASR transcripts) average off-topic score relative to the topic vector. The normalized score is then a function of the average on-topic5 and off-topic scores and the off-topic standard deviation6 . The second approach looked at only the highest scoring off-topic stories returned from a query of the topic vector against a retrospective database with the score normalized in a similar fashion to the first approach. Both attempts reduced the story-weighted miss probability by approximately 10 percent at low false alarm probability relative. However, this results was achieved at the expense of higher miss probability at higher false alarm probability, and a higher cost at the operating point defined by the cost function for the task7. Ctrack = Cmiss P (miss) Ptopic +Cfa P (fa) (1 Ptopic) where Cmiss = 1: (the cost of a miss) Cfa = 1: (the cost of a false alarm) Ptopic = 0:02: (the a priori probability of a story being on a given topic was chosen based on the TDT2 training topics and training corpus.) Because of the less optimal trade-off between error probabilities at the point defined by the cost function, we choose to ignore normalization and look directly at cost as a function of a single threshold value across all topics. We plotted tf idf score against story and topic-weighted cost for the training and development-test data sets. As our global threshold we averaged the scores at which story and topic-weighted cost were minimized. This is depicted in figure 2. Figure 3 shows the same curves for the evaluation data set. The threshold resulting from the training and development test data applies satisfactorily though far from optimally to the evaluation data set. An optimal threshold of 39 would have improved the topicweighted score by 17.6 percent and the story weighted cost by 1.9 5calculated from the training stories. 6 (on-topic) is unreliable for small Nt but for larger Nt the (off-topic) was found to be a good approximation of (on-topic). 7Defined in [2].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-Lingual Topic Tracking using idf-Weighted Cosine Coefficient

We investigate a method of cross-lingual topic tracking which builds upon our cosine coefficient based monolingual approach. The system relies on a bilingual dictionary for translation as well as for word segmentation in the case of Mandarin. While for the true bilingual task, the system performed above average, and performed almost as well as it did over the translated-monolingual data, it per...

متن کامل

Relative Rank Statistics for Dialog Analysis

We introduce the relative rank differential statistic which is a non-parametric approach to document and dialog analysis based on word frequency rank-statistics. We also present a simple method to establish semantic saliency in dialog, documents, and dialog segments using these word frequency rank statistics. Applications of our technique include the dynamic tracking of topic and semantic evolu...

متن کامل

GETracker3: A Robust, Lightweight Topic Tracking System

We describe a topic tracking system developed at GE Corporate R&D Center in connection with our participation in DARPA TDT3 evaluations. The TDT tracking task is specified as follows: Given Nt training news stories on a topic, the system must find all subsequent stories on the same topic in all tracked news sources. These sources include radio and television news broadcasts, as well as newswire...

متن کامل

A Comparative Study using Vector Space Model with K-Nearest Neighbor on Text Categorization Data

Text categorization is one of the well studied problems in data mining and information retrieval. Given a large quantity of documents in a data set where each document is associated with its corresponding category. Categorization involves building a model from classified documents, in order to classify previously unseen documents as accurately as possible. In this paper, we investigate variatio...

متن کامل

Intuitionistic Partition based Conceptual Granulation Topic-Term

Document Analysis represented in vector space model is often used in information retrieval, topic analysis, and automatic classification. However, it hardly deals with fuzzy information and decision-making problems. To account this, Intuitionistic partition based cosine similarity measure between topic/terms and correlation between document/topic are proposed for evaluation. Conceptual granulat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999